Multilingual Wikipedia, Summarization, and Information Trustworthiness

نویسنده

  • Elena Filatova
چکیده

Wikipedia is used as a corpus for a variety of text processing applications. It is especially popular for information selection tasks, such as summarization feature identification, answer generation/verification, etc. Many Wikipedia entries (about people, events, locations, etc.) have descriptions in several languages. Often Wikipedia entry descriptions created in different languages exhibit differences in length and content. In this paper we show that the pattern of information overlap across the descriptions written in different languages for the same Wikipedia entry fits well the pyramid summary framework, i.e., some information facts are covered in the Wikipedia entry descriptions in many languages, while others are covered in a handful number of descriptions. This phenomenon leads to a natural summarization algorithm which we present in this paper. According to our evaluation, the generated summaries have a high level of user satisfaction. Moreover, the discovered pyramid structure of Wikipedia entry descriptions can be used for Wikipedia information trustworthiness verification.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

ACL 2013 MultiLing Pilot Overview

The 2013 Association for Computational Linguistics MultiLing Pilot posed a task to measure the performance of multilingual, single-document, summarization systems using a dataset derived from many Wikipedias. The objective of the pilot was to assess automatic summarization of multilingual text documents outside the news domain and the potential of using Wikipedia articles for such research. Thi...

متن کامل

IIIT Hyderabad in Summarization and Knowledge Base Population at TAC 2011

In this report, we present details about the participation of IIIT Hyderabad in Guided Summarization and Knowledge Base Population tracks at TAC 2011. we have enhanced our summarization system with knowledge based measures. Wikipedia based extraction methods and topic modelling are used to score sentences in guided summarization track. For multilingual summarization task, we investigated the HA...

متن کامل

Multilingual Summarization: Dimensionality Reduction and a Step Towards Optimal Term Coverage

In this paper we present three term weighting approaches for multi-lingual document summarization and give results on the DUC 2002 data as well as on the 2013 Multilingual Wikipedia feature articles data set. We introduce a new intervalbounded nonnegative matrix factorization. We use this new method, latent semantic analysis (LSA), and latent Dirichlet allocation (LDA) to give three term-weight...

متن کامل

NII at the 2006 Multilingual Summarization Evaluation

In this paper I detail the implementation of an extractionbased summarization system that uses sentence clustering and named entity identification as main features for the 2006 Multilingual Summarization Evaluation. I discuss some of the failings of my system, and what can be done to improve it.

متن کامل

Directions for Exploiting Asymmetries in Multilingual Wikipedia

Multilingual Wikipedia has been used extensively for a variety Natural Language Processing (NLP) tasks. Many Wikipedia entries (people, locations, events, etc.) have descriptions in several languages. These descriptions, however, are not identical. On the contrary, descriptions in different languages created for the same Wikipedia entry can vary greatly in terms of description length and inform...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009